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Abstract 



Statistical query (SQ) learning model of Kearns is a natural restriction of the PAC learning model 
in which a learning algorithm is allowed to obtain estimates of statistical properties of the examples 
but cannot see the examples themselves |23j . We describe a new and simple characterization of the 
query complexity of learning in the SQ learning model. Unlike the previously known bounds on SQ 
Q | learning [SJ [TOl [Ml 131 132] our characterization preserves the accuracy and the efficiency of learning. 

The preservation of accuracy implies that that our characterization gives the first characterization of 
SQ learning in the agnostic learning framework of Haussler and Kearns, Schapire and Sellie |19i 125] . 
■ The preservation of efficiency is achieved using a new boosting technique and allows us to derive a 

new approach to the design of evolutionary algorithms in Valiant's model of evolvability [35] ■ We 
use this approach to demonstrate the existence of a large class of monotone evolutionary learning 
algorithms based on square loss fitness estimation. These results differ significantly from the few 
known evolutionary algorithms and give evidence that evolvability in Valiant's model is a more 
versatile phenomenon than there had been previous reason to suspect. 



1 Introduction 



We study the complexity of learning in Kearns' well-known statistical query (SQ) learning model 
|23j . Statistical query learning is a natural restriction of the PAC learning model in which a learning 
algorithm is allowed to obtain estimates of statistical properties of the examples but cannot see the 
examples themselves. Formally, the learning algorithm is given access to STAT(/, D) - a statistical 
query oracle for the unknown target function / and distribution D over some domain X . A query to 
this oracle is a function of an example <j> : X x {—1,1} — > { — 1,1}. The oracle may respond to the 
query with any value v satisfying \Ei x ^r}[(j)(x, f(x))] — v\ < t where r £ [0, 1] is the tolerance of the 
query. 

Kearns demonstrated that any learning algorithm that is based on statistical queries can be au- 
tomatically converted to a learning algorithm robust to random classification noise of arbitrary rate 
smaller than the information-theoretic barrier of 1/2. Most known learning algorithms can be con- 
verted to statistical query algorithms and hence the SQ model proved to be a powerful technique for 
the design of noise-tolerant learning algorithms (e.g. [231 HH HI E3] ) • In fact, since the introduction of 



* Earlier version of this work appeared in the proceedings of the 44th IEEE Symposium on Foundations of Computer 
Science, 2009. 
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the model virtually al{3 known noise-tolerant learning algorithms were obtained from SQ algorithms. 
The basic approach was also extended to deal with noise in numerous other learning scenarios and 
has also found applications in other areas |30l[6j[22]. This makes the study of the complexity of SQ 
learning crucial for the understanding of noise-tolerant learning and PAC learning in general. 

Kearns has also demonstrated that there are information-theoretic impediments unique to SQ 
learning: parity functions require an exponential number of SQs to be learned [23j . Further, Blum et 
al. proved that the number of SQs required for weak learning (that is, one that gives a non-negligible 
advantage over the random guessing) of a concept class C is characterized by a relatively simple com- 
binatorial parameter of C called the statistical query dimension SQ-DIM(C, D) 8 . SQ-DIM(C, D) 
measures the maximum number of "nearly uncorrelated" (relative to distribution D) functions in 
C. Bshouty and Feldman gave an alternative way to characterize weak learning by statistical query 
algorithms that is based on the number of functions required to weakly approximate each function 
in C 10 . These bounds for weak learning were strengthened and extended to other variants of 
statistical queries in several works O |36l [14] . Notable applications of these bounds are lower bounds 
on SQ-DIM of several concept classes by Klivans and Sherstov [28] and an upper-bound on the SQ 
dimension of halfspaces by Sherstov [31] . 

While the query complexity of weak SQ learning is fairly well-studied, few works have addressed 
the query complexity of strong SQ learning. It is easy to see that there exist classes of functions for 
which strong SQ complexity is exponentially higher than the weak SQ complexity. One such example 
is learning of monotone functions with respect to the uniform distribution. The complexity of weak 
SQ learning and hence the statistical query dimension are polynomial [241 111) . However, strong 
PAC learning of monotone functions with respect to the uniform distribution requires an exponential 
number of examples and hence an exponential number of statistical queries [241 [5] . In addition, it is 
important to note that the statistical query dimension and other known notions of statistical query 
complexity are distribution-specific and therefore one cannot directly invoke the equivalence of weak 
and strong SQ learning in the distribution- independent setting pQ. The first explicit^ characterization 
of strong SQ learning with respect to a fixed distribution D was only recently derived by Simon [32] . 

1.1 Our Results 

Our main result is a complete characterization of the query complexity of SQ learning in both 
PAC and agnostic models. Informally, our characterization states that a concept class C is SQ 
learnable over a distribution D if and only if for every real- valued function ijj, there exists a small 
(i.e. polynomial-size) set of functions such that for every / £ C, if sign(^) is not "close" 
to / then one of the functions in G^ is "noticeably" correlated with / — ip. More formally, for a 
distribution D over X, we define the (semi-)inner product over the space of real- valued functions on 
X as (<j),ijj)£) = Ejj^x; \4>{x) ■ ip(x)]. Then C is SQ learnable to accuracy e if and only if for every 
ip : X — > [—1, 1], there exists a set of functions G$ such that (1) for every / € C, if Pr£i[sign(?/0 ^ 
/] > e then \(g,f — i/))d\ > 7 for some g € G$\ (2) \G^\ is polynomial and 7 > is inverse- 
polynomial in 1/e and n (the size of the learning problem). It is known |10j that the number of 
functions required to weakly approximate every function in a set of functions F is precisely the (weak) 
statistical query dimension of F (after the appropriate generalization of the notion to any set of real- 
valued functions). Therefore, our approximation-based characterization leads to a characterization 
based on the following, orthogonality-based dimension: SQ-SDIM(C, D, e) = sup 1/) {SQ-DIM((C \ 

A notable exception is the algorithm for learning parities of Blum et al. [9] which is tolerant to random noise, albeit 
not in the same strong sense as the algorithms derived from SQs. 

2 An earlier work has also considered this question but the characterization that was obtained is in terms of query- 
answering protocols that are essentially specifications of non-adaptive algorithms [3]. 
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B D (sign(ip), e)) — ip,D)}, where B £) (sign(^), e) is the set of functions that differ from sign^) on 
at most e fraction of X and F ~ ip — {f — ip | / € F}. 

An important property of both of these characterizations is that the accuracy parameter in 
the dimension corresponds to the accuracy parameter e of learning (up to the tolerance of the SQ 
learning algorithm) . The advantage of the approximation-based characterization is that it preserves 
computational efficiency of learning. Namely, the set of approximating functions for e-accurate 
learning can be computed efficiently if and only if there exists an efficient SQ learning algorithm 
achieving error of at most e. The orthogonality-based characterization does not preserve efficiency 
but is more easy to analyze when proving lower bounds. Neither of these properties are possessed by 
the previous characterizations of strong SQ learning [3] 132 ] 133 ] . 

The preservation of accuracy implies that both of our characterizations can be naturally extended 
to agnostic learning by replacing the concept class C with the set of all functions that are A-close 
to at least one concept in C (see Th. I4.1|) . Learning in this model is notoriously hard and this is 
readily confirmed by the SQ dimension we introduce. For example, in Theorem l4.6l we prove that the 
SQ dimension of agnostic learning of monotone disjunctions with respect to the uniform distribution 
is super polynomial. This provides new evidence that agnostic learning of conjunctions is a hard 
problem even when restricted to the monotone case over the uniform distribution. The preservation 
of accuracy is critical for the generalization to agnostic learning since, unlike in the PAC model, 
achieving, for example, twice the error (i.e. 2 ■ A) might be a substantially easier task than learning 
to accuracy A + e. 

We note that the characterization of (strong) SQ learning by Simon [32] has some similarity to 
ours. It also examines weak statistical query dimension of F — ip for F C C and some function ip. 
However, the maximization is over all sets of functions F satisfying several properties and <fi is fixed to 
be the average of functions in F. Simon's SQ dimension and the characterization were substantially 
simplified in a very recent and independent work of Szorenyi |33j . As it was shown by Szorenyi, his 
dimension can be easily related to the dimension we use in our second characterization (Th. 13.111) . 
Szorenyi's result is based on a very different technique and does not have preserve efficiency and 
accuracy. 

1.2 Overview of the Proof 

To prove the first direction of our characterization we simulate the SQ learning algorithm for C while 
replying to its statistical queries using ip in place of the unknown target function /. If ip is not close 
to / then one of the queries in this execution has to distinguish between / and ip, giving a function 
that weakly approximates f — ip. Hence the polynomial number of queries in this execution implies 
the existence of the set with the desired property. 

For the second direction we use the fact that (g, f—ip)D > 7 means that g "points" in the direction 
of / from ip, that is, ip + 7 • g is closer to / than ip by at least 7 2 in the norm corresponding to our 
inner product. Therefore one can "learn" the target function / by taking steps in the direction of / 
until the hypothesis converges to /. This argument requires the hypothesis at each step to have range 
in [—1, 1] and therefore we apply a projection step after each update. This process is closely related 
to projected gradient descent - a well-known technique in a number of areas. The closest analogues 
of this technique in learning are some boosting algorithms (e.g. [4]). In particular, our algorithm is 
closely related to the hard-core set construction of Impagliazzo [2D] adapted to boosting by Klivans 
and Servedio [37J . The proof of our result can also be seen as a new type of boosting algorithm that 
instead of using a weak learning algorithm on different distributions uses a weak learning algorithm 
on different target functions (namely f — ip). This connection is explored in [17j . 
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1.3 Applications to Evolvability 

The characterization and its efficiency-preserving proofs imply that if C is SQ learnable then for 
every hypothesis function ip, there exists a small and efficiently computable set of functions N(ip) 
such that if ip is not "close" to / € C then one of the functions in N(ip) is "closer" to / than ip (Th. 
15.41) . This property implies that every SQ learnable C is learnable by a canonical learning algorithm 
which learns C via a sequential process in which at every step the best hypothesis is chosen from a 
small and fixed pool of hypotheses "adjacent" to the current hypothesis. 

This type of learning has been recently proposed by Valiant as one that can explain the acquisi- 
tion of complex functionality by living organisms through the process of evolution guided by natural 
selection |35j . One particular important issue addressed by the model is the ability of an evolutionary 
algorithm to adjust to a change of the target function without sacrificing the fitness of the current 
hypothesis (beyond the decrease caused by the change itself). Existence of algorithms that are ro- 
bust to such changes (we refer to them as monotone) could explain the ability of some organisms to 
adapt to changes in environmental conditions without the need for a "restart" . While the power of 
non- monotone evolvability was resolved in our recent work |14[ I16j , very few examples of monotone 
evolutionary algorithms arc known. Michael's algorithm for evolving decision lists with respect to the 
uniform distribution |29j and the distribution-independent algorithm for evolving singletons (func- 
tions that are positive on a single point) in |16j are the only examples we are aware of. Our canonical 
learning algorithms can be fairly easily translated into evolutionary algorithms demonstrating that 
every concept class C SQ learnable with respect to a distribution D, is evolvable monotonically over 
D (Th. 1531) . 

While we do not know how to extend this general method to the more robust distribution- 
independent evolvability, we show that the underlying ideas can be useful for this purpose as well. 
Namely, we prove distribution-independent and monotone evolvability of Boolean disjunctions (or 
conjunctions) using a simple and natural mutation algorithm (Th. 15.7] ). The mutation algorithm is 
based on slight adjustments of the contribution of each of the Boolean variables while bounding the 
total value of contributions (which corresponds to the projection step). Both of these results are 
based on measuring fitness of a hypothesis using the quadratic loss function. Formal definitions of 
the model and the results are given in Section [SJ 

1.4 Relation to the Earlier Version 

Since the appearance of the earlier version of this work [15] we have found ways to strengthen some 
of the parameters of the characterizations. As a result the dimensions used here differ from the ones 
introduced in [15] . Also, unlike the dimension we use here, the SQD e dimension in |15j preserves 
the output hypothesis space and hence is suitable for characterizing proper learning. To emphasize 
the difference we use different notation for the dimensions defined in the two versions of the work. 
In addition, the characterization of learning in the agnostic model is now simplified using recent 
distribution-specific agnostic boosting algorithms [T71 [5T] . 

2 Preliminaries 

For a positive integer £, let [£] denote the set {1, 2, ... ,£}. We denote the domain of our learning 
problems by A and let denote the set of all functions from A to [—1,1] (that is all the functions 
with norm bounded by 1). It will be convenient to view a distribution D over X as defining 
the product (<j>,i(j)n = E x ~d ' over the space of real-valued functions on A. It is easy 

to see that this is simply a non-negatively weighted version of the standard dot product over M. x 
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and hence is a positive semi-inner product over R x . The corresponding norm is defined as \\</)\\d = 
yf~Ej£>[(f) 2 (x)] = yj (4>, 4>) We define an e-ball around a Boolean function h as B D (h, e) = {g : X — > 
{— 1, 1} | Pi\d[/ ^ g\ < e}. For two real- valued functions </> and ip we let L®(<fi, tp) = ~Erj[\cf>(x)—ip(x)\]. 
For a set of real- valued functions F and a real- valued function -0 we denote by F—ip = {f—t/j \ f € F}. 

2.1 PAC Learning 

For a domain X, a concept class over X is a set of {—1, l}-valued functions over X referred to as 
concepts. A concept class together with a specific way to represent all the functions in the concept 
class is referred to as a representation class. For brevity, we often refer to a representation class as 
just a concept class with some implicit representation scheme. 

There is often a complexity parameter n associated with the domain X and the concept class C 
such as the number of Boolean variables describing an element in X or the number of real dimensions. 
In such a case it is understood that X — \J n>1 X n and C = U n >i We drop the subscript n when it 
is clear from the context. In some cases it useful to consider another complexity parameter associated 
with C: the minimum description length of / under the representation scheme of C. Here, for brevity, 
we assume that n (or a fixed polynomial in n) bounds the description length of all functions in C n . 

The models we consider are based on the well-known PAC learning model introduced by Valiant 
[34] . Let C be a representation class over X. In the basic PAC model a learning algorithm is 
given examples of an unknown function / from C on points randomly chosen from some unknown 
distribution D over X and should produce a hypothesis h that approximates /. Formally, an example 
oracle EX(/, D) is an oracle that upon being invoked returns an example (x, f{x)}, where x is chosen 
randomly with respect to D, independently of any previous examples. 

An algorithm is said to PAC learn C in time t if for every e > 0, S > 0, / G C, and distribution D 
over X, the algorithm given e and access to EX(/, D) outputs, in time t and with probability at least 
2/3, a hypothesis h that is evaluatable in time t and satisfies 'Pru[f(x) ^ h(x)] < e. For convenience 
we also allow real- valued hypotheses in J 7 ^ . Such a hypothesis needs to satisfy (/ (x), h{x)) d > 1 — 2e. 
A real- valued hypothesis <f)(x) can be also thought of as a randomized Boolean hypothesis such 
that <p(x) equals the expected value of $(x). Hence (f(x), (/>(x))d > 1 — 2e is equivalent to saying 
that the expected error of <I>(x) is at most e. We say that an algorithm efficiently learns C when t is 
upper bounded by a polynomial in n, 1/e. 

The basic PAC model is also referred to as distribution-independent learning to distinguish it 
from distribution- specific PAC learning in which the learning algorithm is required to learn only with 
respect to a single distribution D known in advance. 

A weak learning algorithm [26] is a learning algorithm that produces a hypothesis whose disagree- 
ment with the target concept is noticeably less than 1/2 (and not necessarily any e > 0). More 
precisely, a weak learning algorithm produces a hypothesis h g J 7 ^ such that (f(x), }i{x))d > l/p(") 
for some fixed polynomial p. 

2.2 Agnostic Learning 

The agnostic learning model was introduced by Haussler [19] and Kearns et al. [25] in order to model 
situations in which the assumption that examples are labeled by some f G C does not hold. In the 
most general version of the model the examples are generated from some unknown distribution A 
over X x {—1,1}. The goal of an agnostic learning algorithm for a concept class C is to produce 
a hypothesis whose error on examples generated from A is close to the best possible by a concept 
from C . Any distribution A over X x {—1, 1} can be described uniquely by its marginal distribution 
D over X and the expectation of the label b given x. That is, we refer to a distribution A over 
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X x { — 1, 1} by a pair (Da, <Pa) where Da(z) = Pr( x ^^[x = z] and 

<j>A(z) = ^{x,b)^A[b | z = x]. 
Formally, for a function h £ and a distribution A = (D,cf>) over X x {— 1,1}, we define 

A(A,h) = L?( ( f>,h)/2. 

Note that for a Boolean function h, A(A, h) is exactly the error of h in predicting an example drawn 
randomly from A or Pr^^j^J/ifa;) ^ b]. For a concept class C, let A(A, C) = inf/ lS c{A(A, /i)} . 
Kearns et al. [25] define agnostic learning as follows. 

Definition 2.1 An algorithm A agnostically learns a representation class C if for every e > 0, 5 > 0, 
distribution A over X x {—1,1}, A given access to examples drawn randomly from A, outputs, with 
probability at least 2/3, a hypothesis h £ such that A(A, h) < A(A, C) + e. 

As in the PAC learning, the learning algorithm is efficient if it runs in time polynomial 1/e and n. 

More generally, for < a < (3 < 1/2 an (a, /3)-agnostic learning algorithm is the algorithm that 
produces a hypothesis h such that A(A, h) < (3 whenever A(A, C) < a. In the distribution-specific 
version of this model, learning is only required for every A — (D, </>), where D equals to some fixed 
distribution known in advance. 

2.3 The Statistical Query Learning Model 

In the statistical query model of Kearns |23j the learning algorithm is given access to STAT(/, D) 
- a statistical query oracle for target concept / with respect to distribution D instead of EX(/, D). 
A query to this oracle is a function i/j : X x { — 1,1} — > {—1,1}. The oracle may respond to the 
query with any value v satisfying \Ed[iP(x, f(x))] — v\ < r where r G [0, 1] is a real number called 
the tolerance of the query. For convenience, we allow the query functions to be real-valued in the 
range [—1, 1]. As it has been observed by Aslam and Decatur [2], this extension is equivalent to the 
original SQ model. 

An algorithm A is said to learn C in time t from statistical queries of tolerance r if A PAC 
learns C using STAT(/, D) in place of the example oracle. In addition, each query if) made by A 
has tolerance r and can be evaluated in time t. The statistical query learning complexity of C over 
D is the minimum number of queries of tolerance r required to learn C over D to accuracy e and is 
denoted by SLC(C, D, e, t). 

The algorithm is said to (efficiently) SQ learn C if t is polynomial in n and 1/e, and r is lower- 
bounded by the inverse of a polynomial in n and 1/e. 

The SQ learning model extends to the agnostic setting analogously. That is, random examples 
from A are replaced by queries to the SQ oracle STAT(A). For a query ip as above, STAT(A) returns 
a value v satisfying |E^ ^^A^ix, b)} — v\ < r. We denote the agnostic statistical query learning 
complexity of C over D by ASLC(C, D, e, r). 

A correlational statistical query is a statistical query for a correlation of a function over X with 
the target [TU]. Namely the query function ip(x,£) = <t>{x) ■ I for a function <fi £ . We say that a 
query is target-independent if tp(x,£) = <j)(x) for a function <p £ .Ff , that is, if if) is a function of the 
point x alone. We will need the following simple fact by Bshouty and Feldman [10] to relate learning 
by statistical queries to learning by CSQs. 

Lemma 2.2 QlOj) For any function ip : X x {-1,1} — > [-1,1], ip(x,£) = (j)\{x) ■ £ + (j>2(x), for 
some 0i,</>2 € J 7 ^ ■ In particular a statistical query (V>>t) with respect to any distribution D can 
be answered using a statistical query that is target-independent and a correlational statistical query, 
each of tolerance r/2. 
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2.4 (Weak) SQ Dimension 

Blum et al. showed that concept classes weakly SQ learnable using only a polynomial number of 
statistical queries of inverse polynomial tolerance are exactly the concept classes that have polynomial 
statistical query dimension or SQ-DIM [S]. The dimension is based on the largest number of almost 
orthogonal (using the (■, ■) d inner product) functions in the set. 

Definition 2.3 ([SI I36j ) For a concept class C we say that SQ-DIM(C, D) — d if d is the largest 
value for which there exist d functions fi, /a, . . . , fd € C such that for every i ^ j, \{fi, fj) d\ < 1/d. 

Bshouty and Feldman gave an alternative way to characterize weak learning by statistical query 
algorithms that is based on the number of functions required to weakly approximate each function 
in the set [TU] . 

Definition 2.4 For a concept class C and 7 > we say that SQD(C, D,j) = d if there exists a set 
of d functions G C -Ff 3 such that for every f € C , \{f, <?).d| > 7 for some g G G. In addition, no 
value smaller than d has this property. 

Bshouty and Feldman show that a concept class C is weakly SQ learnable over D using a poly- 
nomial number of queries if and only if SQD(C, D, l/t(n)) = d(n) for some polynomials d(-) and 
t(') [ID]- It is also possible to relate SQD and SQ-DIM more directly. It is well-known that the 
maximal set of almost orthogonal functions in C is also the approximating set for C . In other words, 
SQD(C, D, 1/d) < d, where d = SQ-DIM(C, D). The connection in the other direction is implicit in 
the work of Blum et al. [8] . Here we will use a stronger version given by Yang |36] . 

Lemma 2.5 (|36j) Let C be a concept class and D be a distribution over X. ThenSQDiCiD.d- 1 /*)^ 
d 1 / 3 /2, where d = SQ-DIM{C,D). 

3 Strong SQ Dimension 

In this section we give a generalization of the weak statistical query dimension to strong learning. We 
first extend the approximation-based characterization of Bshouty and Feldman |10j and then obtain 
an orthogonality-based characterization from it. 

3.1 Approximation-Based Characterization 

In order to define our strong statistical query dimension we first need to generalize the approximation- 
based characterization of Bshouty and Feldman [10] to sets of real- valued functions rather than just 
concept classes. To achieve this we simply note that the definition of SQD(C, D, 7) does not use the 
fact that functions in C are Boolean and hence we can define SQD(F, D, 7) for any set of real-valued 
functions F in exactly the same way. We now define the strong statistical query dimension of a class 
of functions C. 

Definition 3.1 For a concept class C, distribution D and 6,7 > we define 

SQSD(C,D,e, 7 )= sup {SQD(C \ 5 fl (sign(^), e) - V, A 7)} , 

In other words, we say that SQSD{C, D, e, 7) = d if for every tp S J^^° , there exists a set of d functions 
Gip C such that for every f £ C , either 

1. Pr D [f(x) ^ sign^Oz))] < e or 
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2. there exists g £ such that \(f — ip,g)o\ > 7- 
In addition, no value smaller than d has this property. 

We now give a simple proof that SQSD(C, D, e, 7) characterizes (within a polynomial) the number 
of statistical queries required to learn C over D with accuracy e and query tolerance 7. 

Theorem 3.2 For every concept class C, distribution D over X and e, r > 0, 

SLC(C, D, e, r) > SQSD(C, D, e + r, t) - 2 . 

Proof: Let A be a SQ algorithm that learns C over Z? using q = SLC(C, D, e, r) queries of tolerance 
t. According to Lemma l2~2l we can assume that A makes only correlational SQs of tolerance r since 
A can compute the values of the target-independent SQs exactly. 

Now let ip £ JF^° be any function. The set G$ is constructed as follows. Simulate algorithm A 
and for every correlational query (fa ■ I, r) add fa to G$ and respond to the query with the value 
(ip, fa) d = ~&D[4>i{x) • ip(x)]. Continue the simulation until A outputs a hypothesis h^. Add sign(^) 
and to G^. 

First, by the definition of q > |Gv,| — 2. Now, let / be any function in C. If there does 
not exist g £ G^ such that |(/ — ip, g)r>\ > t then for every correlational query function fa £ Gy, 
\(ip,fa)D — (fifa)o\ < t . This means that in our simulation, (ip,fa)u is within r of (/, fa)o- 
Therefore the answers provided by our simulator are valid for the execution of A when the target 
function is /. That is they could have been returned by STAT(/, D) with tolerance r. Therefore, by 
the definition of A, the hypothesis satisfies (f,h^)o > 1 — 2e. Both sign(^/>) and h^p are in G$ 
and therefore we also know that |(/ — ip,sign.(ip)}r)\ < t and |(/ — ip,h^,)D\ < T. These conditions 
imply that (f,siga(ip))r) > (ip , sign(ip)) d — t and (ip,h^)o > (f,h^)r) — t. In addition for all 
ip,hti> £ , (ip, sign(i/)))c > (ip, h^jD- By combining these inequalities we conclude that 

(/i sign(ip)) D > (ip, sign(ip)) D -T>(ip, h^) D - r > (f, h^) D - 2t > 1 - 2e - 2r , 

which is equivalent to Pro[/(x) 7^ sign(-0(a;))] < e + r. In other words, if there does not exist 
g £ G^j, such that |(/ — ip,g)o\ > t then / £ B D (sign(ijj), e + r), giving us the claimed inequality. 
□ 

Remark 3.3 If A is randomized then it can be converted to a non-uniform deterministic algorithm 
via a standard transformation (e.g. [10]). Therefore, Theorem \3.2\ also applies to SQ learning by 
randomized algorithms. 

We now establish the other direction of our characterization. 

Theorem 3.4 For every concept class C , distribution D over X and e, r > 0, 

SLC(C,D,e,r) < SQSD(C,D,€,4-t)/(3t 2 ) . 

Proof: Let d = SQSD(C, D, e, 4 • r). Our learning algorithm for C builds an approximation to the 
target function / in steps. In each step we have a current hypothesis ipi £ J 7 ^ . If s±gn(ipi) is not 
e-close to / then we find a function g £ G^ t such that |(/ — ipi, g) d\ > 7- Such g can be viewed as a 
vector "pointing" in the direction of / from ipi. We therefore set ip' i+1 = ip t + (/ — ipi, g) d • 9- As we 
will show ip' i+1 is closer (in distance measured by || • ||d) to / than ipi. However ip[<i is not necessarily 
in . We define V'i+i to be the projection of ip' i+1 onto . As we will show this projection step 
only decreases the distance to the target function. We will now provide the details of the proof. 

Let ip Q = 0. Given ipi we define V'i+i as follows. Let G x p i be the set of size at most d that 
correlates with every function in C \ B D (sign(i/ji),e) — fa (as given by Definition 13. ip . For every 



8 



g G G^ i we make a query for (/, g) £> to STAT(/, D) with tolerance r and denote the answer by v(g). 
If there exists g € such that \v(g) — (ipi,g)D\ > 3r then we set gi — g, 74 = v(gi) — (ipi, gi) d, and 
ip' i+ i = Tpi+^i ■ gi. Otherwise the algorithm outputs sign(V>i). Note that if sign(-0i) is not e-close to 
/ then there exists g g G^ Ji such that |(/ — ipi,g)o\ > 4r and, in particular, \v(g) — (ipi,g)D\ > 3r. 
We then set ipi+i to be the projection of ip' i+1 onto J-^°: 

I ( \ - P I I' \ A / ^i+i( x ) Wi+i{ x )\ < 1 

MW-nff^ \ sign(V^ +1 (a;)) otherwise. 

We then continue to the next iteration using ipi+i- 

As we can see sign(ipi) is only output when sign(?/^) is e-close to /. Therefore in order to prove 
the desired bound on the number of queries it is sufficient to show that the algorithm will output 
sign(-0i) after an appropriate number of iterations. This is established via the following claim. 

Claim 3.5 For every i, \\f — ipiWo — 1 — 3 • i ■ r 2 . 
Proof: First, ||/ - Ml = \\f\\l = L Next ' 

11/ - ti+iWl = IK/ - ~ 7* ■ 9i\\l = \\f- *l>i\\ 2 D + H ■ gS> - 2(/ - Am ■ 



D ■ 



Therefore, 

11/ -hWn-Wf- tfi+iWh = 2 7,(/ - 1>i,9i)D ~ lf\\9i\\l > 2 • 7i • </ - 1>i,9i)n ~ if 

=W 2 • | 7i | ■ |</- V*,ff<>-o| ~ 7, 2 > 2 • | 7 i|(|7i| " r) - 7? > 7, 2 /3 > 3 • r 2 . 

To obtain (*) we note that > 3r and |(/ — ipi,gi)D — 7i| = |(/><?i)-D — u (.9i)| < r - Therefore the 
sign of 7j is the same as the sign of (/ - ip l ,g l )o and |(/ - ^i,gi)D\ > —r> 2^/3. 

We now claim that ||/ — t/'i+illzj — 11/ — ^i+iWd- This follows easily from the definition of V'i+i- 
If for a point x, ipi+i( x ) = ip'i+xi®) then clearly f(x) — V J i+i C^) — f( x ) ~ C^) ■ Otherwise, if 
\">Pi+i( x )\ > 1 then ^i+i( x ) = si g n (^ + i(^)) and for any value f(x) G {-1,1}, \f(x) - ^' l+1 (x)\ > 
\f(x) - ip i+ i{x)\. This implies that E D [(/ - ^ l+l ) 2 ] > E D [(f - ^+i) 2 ]. 

We therefore obtain that for every i, \\f — V'illi) ~ 11/ ~ V-'i+illi) > 3t 2 giving us the claim. □ (CI. 

EES) 

Claim [23] implies that the algorithm makes at most l/(3r 2 ) iterations. In each iteration at most 
d queries are made and therefore the algorithm uses at most d/(3r 2 ) queries of tolerance r. □ (Th. 

An important property of the proofs of Theorems l3. 21 and 13.41 that they give a simple and efficient 
way to convert a learning algorithm for C into an algorithm that given access to target-independent 
statistical queries with respect to D builds an approximating set for every ip and vice versa. 
The access to target-independent statistical queries with respect to D can be replaced by a circuit 
that provides random samples from D if D is efficiently samplable or a fixed polynomial-size random 
(unlabeled) sample from D (in this case the resulting algorithm is non- uniform). For convenience we 
refer to either of these options as access to D. 

Theorem 3.6 Let C be a concept class and D be a distribution over X . C is efficiently SQ learnable 
if and only if there exists an algorithm B that for every e > 0, given e, access to D and a circuit for 
ip G J-^° can produce a set of functions such that 

1. G^p satisfies the conditions of Definition [3A\ for some polynomial d and inverse-polynomial 7 
(in n, 1/ej,' 

2. circuit size of every function in G^ is polynomial in n and 1/e; 
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3. the running time of B is polynomial in n, 1/e and the circuit size of if). 

Proof: The proof of Theorem 13 .21 gives a way to construct the set G$ by simulating A while using ip 
in place of the target function /. This construction of would be efficient provided the exact values 
of Eu[(f>i(x) ■ tp(x)] and the exact values of target-independent SQs in the simulation of algorithm A 
were available. However it is easy to see that the exact values are not necessary and can be replaced 
by estimates within r/2. Such estimates can be easily obtained given access to D. 

Similarly, in the proof of Theorem [331 the iterative procedure would yield an efficient SQ learning 
algorithm for C provided the exact values of (ipi,g)o were available. It is again easy to see that in 
place of exact values estimates within r/2 can be used if the accuracy of statistical queries is also 
increased to r/2. This implies that if there exists an efficient algorithm that given a polynomial size 
circuit for ip £ and access to D generates G^ then C is efficiently SQ learnable over D. □ 

3.2 Orthogonality-Based Characterization 

In order to simplify the application of our characterization we show that, with only a polynomial 
loss in the bounds one can obtain an orthogonality-based version of SQSD. Specifically, we convert 
the bound on the number of functions required to weakly approximate every function in some set of 
functions F to a bound on the maximum number of almost uncorrelated functions in F. 

In Lemma 13.81 we generalize Yang's conversion fLemma l2.5p to sets of arbitrary real- valued func- 
tions. But first we need to appropriately extend the definition of SQ-DIM to sets of arbitrary 
real-valued functions. For this purpose we simply use Definition 12.31 applied to sets of real-valued 
functions. 

Definition 3.7 For a set of real-valued functions F we say that SQ-DIM(F, D) = d if d is the largest 
value for which there exist d functions /i,/j,...,/((£ F such that for every i ^ j, \{f%, fj)o\ < 1/d. 

Lemma 3.8 Let D be a distribution and F be set of functions such that every tf> £ F, m < \\4>\\d < M 
for some M > 1 > m. Then SQD(F, D, M(dm 2 )- 1 / 3 ) > (dm 2 ) 1 ' 3 /2, wh ere d = SQ-DIM(F,D). 

Proof: Yang shows that our claim is correct if for every <p G F, \\4>\\d = 1 [123 Cor. 1]. While his 
claim (Lemma 12. 5p is only for Boolean functions the only property of Boolean functions used in his 
proof is their || • ||£>-norm being equal to 1. We reduce our general case to this special case by defining 
F' = {f/\\f\\D | / S F}. We claim that SQ-DIM(F', D) >d-m 2 . It is easy to see this since if for 
fi,f 2 eF, (f 1 ,f 2 )D<l/d then 

(_h fj_\ < 1 

Willi?' \\M\d/ d - dm 2 ■ 

This means that the existence of a set of d functions in F with correlations of at most 1/d would 
imply the existence of d > d ■ m 2 functions in F' with mutual correlations of at most 1/ {dm 2 ). 

We apply Yang's result to F' and obtain SQD(F', D, (dm 2 )- 1 / 3 ) > (dm 2 ) 1 / 3 /2. This implies that 
SqT)(F,D,M(dm 2 y 1 / 3 ) > (dm 2 ) 1 ' 3 /2. To see this assume for the sake of contradiction that there 
exists a set G of size less than (dm 2 ) 1 / 3 /2 such that for every / 6 F, |(/,fl)u| > A/(dm 2 ) -1 / 3 for 
some g G G. Then for every /' - f/\\f\\ D e F' , \{f,g) D \ = |(/,.9)d|/||/||d > M (dm 2 )' 1 / 3 /\\f\\ D > 
(dm 2 )' 1 / 3 . This would violate the bound on SQD(F',D, (dm 2 )- 1 / 3 ) that we have obtained. □ 

We define SQ-SDIM(C, D, e) to be the generalization of SQ-DIM to e-accurate learning as follows. 

Definition 3.9 SQ-SDIM(C, D, e) = sup^oo SQ-DIM(C \ B D (sign(^), e) - ip), D). 
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We now ready to relate SQSD and SQ-SDIM. 

Theorem 3.10 LetC be a concept class D be a distribution over X , e > andd = SQ-SDIM(C, D,e). 
Then SQSD{C, D, e, l/(2d)) < d and SQSD{C, D, e, 2(ed)' 1 / 3 ) > {ed) 1 / 3 /2. 

Proof: Let ip G 7"f be any function, let = C\B D (sigD.(ip), e)-ip and let d' = SQ-DIM(i^, D) < 
SQ-SDIM(C, D, e) = d. 

For the first part of the claim we use a minor modification of the standard relation between SQD 
and SQ-SDIM (see Section l2"^fj) . Let F\ — {fa, fa, . . . , f#} C E$ be the largest set of functions such 
that for every i ^ j, \(fi,fj)o\ < 1/d'. The maximality of d' implies that for every / G F^, there 
exists fi G Fi such that \(fi, J)d\ > 1/d'. Thus F\ is an approximating set for F$. The only minor 
problem is that we need an approximating set of functions in . The domain of each function in 
Fff, is [—2,2] and therefore to obtain an approximating set in we simply scale F\ by 1/2. By 
taking = {f/2 | / G Fi} we obtain that SQD(F V> , D, l/(2d')) < a". This holds for every i/ieJf 
and therefore SQSD(C, D, e, l/(2d)) < d. 

For the second part of the claim we first observe that that for every / G F^ , f = c — ip for 
c G C and Pr D [c ^ sign(^)] > e. This implies that Pr D [|/| > 1] > e and hence 2 > > 
y/e. We now use Lemma El] to obtain SQD(F^, D, 2(ed')" 1/3 ) > (ed') 1 / 3 /2. This implies that 
SQSD(C, J D,e,2(ed)- 1 /3) > (ed) 1 / 3 /2. □ 

We can combine Theorem 13. 101 with the approximation-based characterization (Th. 13.21 and l3.4[) 
to obtain a characterization of strong SQ learnability based on SQ-SDIM. 

Theorem 3.11 Let C be a concept class, D be a distribution over X and e > 0. If there exists a 
polynomial p{- , •) such that C is SQ learnable over D to accuracy e from p(n,l/e) queries of tolerance 
l/p(n,l/e) then SQ-SDIM(C,D,e + l/p(n,l/e)) < p'(n,l/e) for some polynomial p' {■,■). Further, if 
SQ-SDIM{C,D,e) < p(n, 1/e) then C is SQ learnable over D to accuracy e from p'{n, 1/e) queries 
of tolerance l/p'(n, 1/e) for some polynomial p'(-, •). 

4 SQ Dimension for Agnostic Learning 

In this section we extend the statistical query dimension characterization to agnostic learning. Our 
characterization is based on the well-known observation that agnostic learning of a concept class 
C requires (a weak form of) learning of the set of functions F in which every function is weakly 
approximated by some function in C |25] . For example agnostic learning of Boolean conjunctions 
implies weak learning of DNF expressions. We formalize this by defining an L{° e-ball around a 
real- valued function <fi over X as B{°((j), e) = {ip G \ 4>) < £ } an d around a set of functions 

C as Li(C,e) = U f e cB{° (<p, e). In (a, /3)-agnostic learning of a function class C over the marginal 
distribution D, the learning algorithm only needs to learn when the distribution over examples 
A = (D, <j>) satisfies A(A, C) < a. In other words, for any A — (D, <fi) such that there exists c G C, 
for which A(A, c) = Lf(<^, c)/2 < a. Therefore (a, /?)-agnostic learning with respect to distribution D 
can be seen as learning of the set of distributions T> — {(D, (f) \ (f> G B{°(C, 2a)} with error of at most 
P. This observation allows us to apply the characterizations from Section [3] after the straightforward 
generalization of SQSD and SQ-SDIM to general sets of real-valued functions. Namely, for a set of 
real- valued functions F, we define 

SQSD(F,£>,e,7)= sup {SQDfF \ flf (sign(V), 2e) - ^, D, 7)} . 

The SQ-SDIM(F, D, e) is defined analogously. It is easy to see that when F contains only {—1, 1} 
functions these generalized definitions are identical to Definitions 13.11 and 13.91 
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We can now characterize the query complexity of (a, /3)-agnostic SQ learning using SQSD(i?f ) (C, 2- 
a),D,/3,j) in exactly the same way as SLC is characterized using SQSD(C, D, e, 7). Formally, we 
obtain the following theorem. 

Theorem 4.1 Let C be a concept class, D be a distribution D over X and < a < (3 < 1/2. Let d 
be the smallest number of SQs of tolerance r sufficient to (a, /3)-agnostically learn C. Then 

1. d> SQSD(B?(C,2- a),D,/3 + r,r) -2, 

2. d< SQSD(B°(C,2- a), D,[3,4-t) /(3t 2 ). 

To prove Theorem l4.1l we only need to observe that the proofs of Theorems l3.2l and l3.4l do not assume 
that the concept class C contains only Boolean functions and hold for any class of functions contained 
in To obtain a characterization of (a, /3)-agnostic SQ learning using SQ-SDIM we would also 

need to extend Theorem 13. 101 to general sets of functions in . 

Theorem 4.2 Let F C be a set of functions, D be a distribution over X , e > and d = 
SQ-SDIM(F,D,e). Then SQSD{F,D,e, l/(2d)) < d and SQSD(F, D, e, 2d~ 1 / 5 ) > (ed) 1 / 3 /2. 

Proof: We first observe that the first part of the proof of Theorem 13.101 can be used verbatim for 
more general sets of functions. However the proof of the second part relies on a lower bound of 
y/e on the || • ||i)-norm of every function in F^, = F \ B D (sigii(ip), e) — if). In place of this lower 
bound we observe that if there exists a function / G F^ such that ||/||_d < 2d~ 1 / 5 then there does 
not exist g e such that (f,g)n > 2d~ 1 / 5 ) and, in particular, SQD(F^,, D, e, 2<i~ 1 / 5 ) = 00. This 
would imply the claim. Otherwise (when > 2d -1 / 5 for all / G F^), we can apply Lemma 

EH (with m = 2d- 1 / 5 ) to obtain SQD(F^,,D, 2 1/3 d _1/5 ) > (ed) 1/3 /2. In either case we obtain that 
SQSD(F, D, e, 2d- 1 / 5 ) > {ed) 1 / 3 /2 □ 
While we can now use SQSD or SQ-SDIM to characterize SQ learnability in the basic agnostic 
model a simpler approach to characterization is suggested by recent distribution-specific agnostic 
boosting algorithms [171 121] . Formally, a weak agnostic learning algorithm is an algorithm that 
can recover at least a polynomial fraction of the advantage over the random guessing of the best 
approximating function in C. Specifically, on a distribution A — {D, </>) it produces a hypothesis h 
such that (h, 4>)d > p(^/ n , 1 — 2A(A, C)) for some polynomial p(-, •). Distribution-specific agnostic 
boosting algorithms of Kalai and Kanade [21] and Feldman [17] imply the equivalence of weak and 
strong distribution-specific agnostic learning. 

Theorem 4.3 ( |17L [21] ^ LetC be a concept class and D be a distribution over X . If C is efficiently 
weakly agnostically learnable over D then C is agnostically learnable over D. 

This result is proved only for the example-based agnostic learning but, as with other boosting al- 
gorithms, it can be easily translated to the SQ model (cf. [I]). Given Theorem 14.31 we can use the 
known characterizations of weak learning together with our simple observation to characterize the 
(strong) agnostic SQ learning using either SQD or SQ-DIM. 

Theorem 4.4 Let C be a concept class and D be a distribution over X. There exists a polyno- 
mial p(v) such that ASLC(C,D,e,l/p(n,l/e)) < p(n,l/e) if and only if there exists a polynomial 
polynomial j/( v ) such that for every 1 > T > 0, SQD{B?{C,l- T), D, l/p'(n, 1/F)) <p'(n,l/r). 

Proof: The proof is essentially the same as the characterization of weak learning by Bshouty and 
Feldman [10) . We review it briefly for completeness. Given T > and an agnostic learning algorithm 
A for C, we simulate A with e = T/4 as in the proof of Theorem 13.21 for ip = 0. Let G be the set 
containing the correlational queries obtained from A and the final hypothesis. By the same analysis 
as in the proof of Theorem 13.21 the size of G is upper-bounded by a polynomial in n and 1/e = 4/T. 
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Further, for every <f> £ Sf(C, 1 — T), there exists g £ G such that \(g, 4>) d\ > min{r, T — 2e} = 
min{r, T/2}. The tolerance of the learning algorithm is lower bounded by the inverse of a polynomial 
(in n and 1/r) and therefore we obtain the first direction of the claim. 

If for every T > 0, SQD(Sf (C, 1 - r), D, l/p'(n, 1/r)) < p'(n, 1/r) then C can be weakly agnos- 
tically SQ learned by the following algorithm. First, ask the query g ■ £ with tolerance l/(3f>'(n, 1/r) 
for each function g in the approximating set G. Fet v(g) denote the answer to the query for g. For 
a distribution A = (D,<fr), E,A[g(x) ■ b] = (g,(f)D and therefore \v(g) — {g,<t>)o\ < l/(3p'(n, 1/T). 
By choosing g' = argmax ffgG {|i>(g)|} we are guaranteed that \(g',4>)D\ > l/(3p'(ra, 1/r)). Therefore 
sign(v((/)) • g' is a weak hypothesis for /. Finally, we can appeal to Theorem 14.31 to convert this 
weak agnostic learning algorithm to a strong agnostic learning algorithm for C over D. □ 

As before, we can now obtain an SQ-DIM-based characterization from the SQD-based one. 

Theorem 4.5 Let C be a concept class and D be a distribution over X. There exists a polyno- 
mial p(-,-) such that ASLC(C,D,e,l/p(n,l/e)) < p(n,l/e) if and only if there exists a polynomial 
polynomial p'(-, •) such that for every 1 > T > 0, SQ-DIM{B?(C, 1 - T),D) < p'(n, 1/r). 

Proof: Fet d = SQ-DIM(Bf (C, 1 - r),D). Every function / £ B?(C,1 - T) satisfies ||/|| D > 
Ed[|/|] > T. Therefore, Lemma S3] implies that SQD( J Bf(C*, 1 — T), D, (r 2 d) -1 / 3 ) > (T 2 d) 1/3 /2. 
This implies that d < p x (SQD(Bf (C, 1 - r), D, l/p 2 (n, 1/r)), 1/r) for some polynomials pi(;-) 
and p2{-,-). As in the case of concept classes, it follows immediately from the definition that d > 
SQD(Bf (C, 1 - T), D, 1/d). These bounds together with Theorem S3] imply the claim. □ 
We now give a simple example of the use of this characterization. For X = {0, 1}", let U 
denote the uniform distribution over {0,1}" and let C n: k denote the concept class of all monotone 
conjunctions of at most k Boolean variables. 

Theorem 4.6 For every k = w(l), the concept class C n ,k is not efficiently agnostically SQ learnable 
over the uniform distribution U . 

Proof: Fet \t denote the parity function of the variables with indices in T C [n]. Fet ct de- 
note the monotone conjunction of the same set of variables. Ptu[xt(x) ^ ct(x)] = 1/2 — 2~' T I 
and therefore L^(xt,ct) = 1 — 2~I T I +1 . In particular, for P n ^ k — {\t I \T\ < k}, we get 
P n ,k Q BY(C ntk ,l - 2" fe+1 )). For any two distinct parity functions xs and Xt, {xs,Xt)u = 
and therefore SQ-DIM(B 1 c/ (C„, fe , 1 - 2~ k+1 ), U) > \P k \ = £, ( < fc (™). By choosing V = l/n we obtain 
that SQ-DIM(B[ / (C„, fc , 1 - T), U) = n u ^. Theorem S3] now implies the claim. □ 

5 Applications to Evolvability 

In this section we use the characterization of SQ learnability and the analysis in the proof of Theorem 
13.41 to derive a new type of evolutionary algorithms in Valiant's framework of evolvability [35] . 

5.1 Overview of the Model 

We start by presenting a brief overview of the model. For detailed description and intuition behind 
the various choices made in model the reader is referred to [351 US] ■ The goal of the model is to 
specify how organisms can acquire complex mechanisms via a resource-efficient process based on 
random mutations and guided by fitness-based selection. The mechanisms are described in terms 
of the multi argument functions they implement. The fitness of such a mechanism is measured by 
evaluating the correlation of the mechanism with some "ideal" behavior function. The value of the 
"ideal" function on some input describes the most beneficial behavior for the condition represented 
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by the input. The evaluation of the correlation with the "ideal" function is derived by evaluating the 
function on a moderate number of inputs drawn from a probability distribution over the conditions 
that arise. These evaluations correspond to the experiences of one or more organisms that embody 
the mechanism. A specific "ideal" function and a distribution over the domain of inputs effectively 
define a fitness landscape over all functions. 

Random variation is modeled by the existence of an explicit algorithm that acts on some fixed 
representation of mechanisms and for each representation of a mechanism produces representations 
of mutated versions of the mechanism. The model essentially does not place any restrictions on the 
mutation algorithm other than it being efficiently implementable. Selection is modeled by an explicit 
rule that determines the probabilities with which each of the mutations of a mechanism will be chosen 
to "survive" based on the fitness of all the mutations of the mechanism and the probabilities with 
which each of the mutations is produced by the mutation algorithm. 

As can be seen from the above description, a fitness landscape (given by a specific "ideal" function 
and a distribution over the domain), a mutation algorithm, and a selection rule jointly determine 
how each step of an evolutionary process is performed. A class of functions C is considered evolvable 
in selection rule Sel with respect to a distribution D over the domain if there exist a representation 
of mechanisms R and a mutation algorithm M such that for every "ideal" function / £ C, a sequence 
of evolutionary steps starting from any representation in R and performed according to f,D,M and 
selection rule Sel converges in a polynomial number of steps to /. The convergence is defined as 
achieving fitness (which is the correlation with / over D) of at least 1 — e for some e > referred 
to as the accuracy parameter. This process is essentially PAC learning of C over distribution D 
with the selection rule (rather than explicit examples) providing the only target-specific feedback. 
An evolvable class of functions C represents the complexity of structures that can evolve in a single 
phase of evolution driven by a single "ideal" function. We now define the model formally using the 
notation from [16] . 

5.2 Definition of Evolvability 

The description of a mutation algorithm A consists of the definition of the representation class R of 
possibly randomized hypotheses in J 7 ^ and the description of polynomial time algorithm that for 
every r £ R and e > outputs a random mutation of r 

Definition 5.1 A mutation algorithm A is defined by a pair (R, M) where 

• R is a representation class of functions over X with range in [—1,1]. 

• M is a randomized polynomial time Turing machine that, given r £ R and 1/e as input, outputs 
a representation r\ £ R with probability Pr.4 (r, r\ ) . The set of representations that can be output 
by M(r,e) is referred to as the neighborhood of r for e and denoted by Neigh^r, e). 

A loss function L is a non-negative mapping L : Y x Y — > R + . L(y,y') measures the "distance" 
between the desired value y and the predicted value y' . We will discuss linear loss L\(y, y') = \y' — y\ 
and the quadratic loss Lg(y, y') — (y' — y) 2 functions. For a function £ J 7 ^ its fitness (also referred 
to as performance in earlier work) relative to loss function L, distribution D over the domain and 
target function / is defined as 

LPerfy^, D) = 1 — 2- V D [L(f(x),4>(x))]/L(-l, 1) . 

For an integer s, functions <f>, f £ J-^° over X, distribution D over X and loss function L, the empirical 
fitness LPerf f((f>, D, s) of is a random variable that equals 1 — ^ 1) Sie[«] L(f(zi), 4>{zi)) for 
Zi, Z2, . ■ ■ , z s £ X chosen randomly and independently according to D. 
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A number of natural ways of modeling selection were discussed in prior work [351 116] . For 
concreteness here we use the selection rule used in Valiant's main definition in a slightly generalized 
version from |16j . In selection rule SelNB[L, t,p, s] p candidate mutations are sampled using the 
mutation algorithm. Then beneficial and neutral mutations are defined on the basis of their empirical 
fitness LPerf in s experiments (or examples) using tolerance t. If some beneficial mutations are 
available one is chosen randomly according to their relative frequencies in the candidate pool. If 
none is available then one of the neutral mutations is output randomly according to their relative 
frequencies. If neither neutral or beneficial mutations are available, _L is output to mean that no 
mutation "survived" . 

Definition 5.2 For a loss function L, tolerance t, candidate pool size p, sample size s, selection 
rule SelNB[L, t,p, s] is an algorithm that for any function f, distribution D, mutation algorithm A = 
(R,M), a representation r € R, accuracy e, SelNB[L, t,p, s](/, D, A, r) outputs a random variable 
that takes a value n determined as follows. First run M(r, e) p times and let Z be the set of 
representations obtained. For r' G Z , let Prz(r ) be the relative frequency with which r' was generated 
among the p observed representations. For each r'fZU {r}, compute an empirical value of fitness 
v{r') = LPerf f (r',D,s). LetBene(Z) = {r' \ v(r') > v{r)+t} andNeut(Z) = {/ | \v(r')-v{r)\ < t}. 
Then 

(i) if Bene(Z) ^ then output r\ € Bene with probability Pr^(ri)/ X)r'eBene(z) P r -z(r')/ 

(ii) ifBene(Z) — andNeut(Z) ^ then outputr^ 6 Neut(Z) with probability P r x('"i)/53r'eneut(Z) ^ >r z( r ')- 
(Hi) //Neut(Z) U Bene(Z) = then output _L. 

A concept class C is said to be evolvable by a mutation algorithm A guided by a selection rule 
Sel over distribution D if for every target concept f € C, mutation steps as defined by A and guided 
by Sel will converge to /. 

Definition 5.3 For concept class C over X , distribution D, mutation algorithm A, loss function 
L and a selection rule Sel based on LPerf we say that the class C is evolvable over D by A in 
Sel if there exists a polynomial g(n, 1/e) such that for every n, f € C, e > 0, and every ro € 
R, with probability at least 1 — e, a sequence ro, Ti, r-i, . . ., where Ti Sel(/, D, A, fj-i) will have 
LPexff(r g r nt i/ e \,D) > 1 — e. We refer to the algorithm obtained as evolutionary algorithm (.4, Sel). 

We say that an evolutionary algorithm (A, Sel) evolves C over D monotonically if with probability at 
least 1 — e, for every i < g(n, 1/e), LPerf /(fj, D) > LPerf /(ro, D), where g(n, 1/e) and ro, r%, r%, . . . 
are defined as above. 

As in PAC learning, we say that a concept class C is evolvable in Sel if it is evolvable over all dis- 
tributions by a single evolutionary algorithm (we emphasize this by saying distribution-independently 
evolvable). A more relaxed notion of evolvability requires convergence only when the evolution starts 
from a single fixed representation ro. Such evolvability is referred to as evolvability with initialization. 

5.3 Monotone Distribution-Specific Evolvability from SQ Learning Algo- 
rithms 

In our earlier work [16] it was shown that every SQ learnable concept class C is evolvable in 
SelNB[LQ, t,p, s] (that is the basic selection rule with quadratic loss) for some polynomials p(n, 1/e) 
and s(n, 1/e) and an inverse polynomial t(n, 1/e). The mutation algorithms obtained in this result 
do not require initialization but instead are based on a form of implicit initialization that involves 
gradual reduction of fitness to if the process of evolution is not started in some fixed ro. Such 
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"deliberate" gradual reduction in fitness appears to be somewhat unnatural. Hence we consider the 
question of whether it is possible to evolve from any starting representation without the need for 
such implicit initialization and fitness decreases in general in other words, which concept classes are 
evolvable monotonically. In this section we show that monotone evolutionary algorithms exist for 
every SQ learnable concept class when evolving with respect to a fixed distribution D and using the 
quadratic loss function. 

The key element of the proof of this result is essentially an observation that the SQ algorithm 
that we designed in the proof Theorem 13 .41 can be seen as repeatedly testing a small set of candidate 
hypotheses, and choosing one that reduces the || • |||, distance to the target function. Converting 
such an algorithm to an evolutionary algorithm is a rather straightforward process. First we show 
that Theorem 13.21 gives a way to compute a neighborhood of every function ip that always contains 
a function with fitness higher than ip (unless the fitness of ip is close to the optimum). 

Theorem 5.4 Let C be a concept class over X and D be a distribution. If C is efficiently SQ 
learnable over D then there exists an algorithm Af that for every e > 0, given e, access to D and a 
circuit for ip £ J-^° can produce a set of functions N{ip, e) such that 

1. For every f £ C , there exists (p € N(ip, e) such that 

||/-0||? 3 <min{e,||/-^|| 2 D -0(n,l/e)}, 

for an inverse-polynomial 8(-, ■); 

2. the size of N(ip,e) is polynomial in n and 1/e; 

3. the circuit size of every function in N{ip,e) is (additively) larger than the circuit size of ip by 
at most a polynomial in n and 1/e; 

4- the running time of Af is polynomial in n, 1/e and the circuit size of if. 

Proof: We use Theorem 13.61 to obtain an algorithm B that given a circuit for ip and access to D, 
efficiently constructs set of polynomial size for some inverse polynomial j(n, 1/e). Let G^(e/4) 
be the output of B with its accuracy parameter set to e/4. Now let 

N(iP,e) = {P 1 (iP + 1 -g) | g G G^,(e/A)}\^j{P 1 (ip -7-3) | g € G^(e/4)} U {sign(^)} . 

By the properties of G,/,(e/4), for every / £ G, cither there exists a function g € G^(e/4) such that 
|(/ — ip,g)o\ > 7( n >4/e), or Pr/j[/ 7^ sign(^)] < e/4. In the first case, by the analysis in the 
proof of Theorem[32D ip g = Pi(ip + b ■ 7(71, 4/e) • g) satisfies ||/- <ip g \\ 2 D < \\f - iP\\ 2 D - 7(71, 4/e) 2 for 
b = sign((/ — ip,g)D)- In the second case, ||sign(V') — f\\ 2 D < 4 • e/4 = e. Theorem 13.61 also implies 
that the algorithm that we have defined satisfies the bounds in conditions (2)- (4). □ 
An immediate corollary of this result is monotone evolvability of every SQ-learnable concept class 
in SelNB[LQ, t,p, s] over any fixed distribution D. 

Theorem 5.5 Let D be a distribution andC be a concept class efficiently SQ learnable over D. There 
exist polynomials p(n, 1/e) and s(n, 1/e), an inverse polynomial t(n, 1/e) and a mutation algorithm 
A = (R, M) such that C is evolvable monotonically by A over D in SelNB[LQ, t(n, l/e),p(n, 1/e), s(n, 1/e)] . 
Here if D is not efficiently samplable then A is a non-uniform algorithm. 

Proof: Let R be the representation class containing all circuits over X and let r be any representation 
in R. Given r and 1/e the algorithm M uses the algorithm Af from Theorem 15.41 with parameters 
r and e to obtain N(r, e). The algorithm Af requires access to distribution D and can be simulated 
efficiently if D is efficiently samplable or simulated using a fixed random sample of points from D 
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otherwise (as it is done for example in |14j). The algorithm M outputs a randomly and uniformly 
chosen representation in N(r, e). 

In order for this evolutionary algorithm to work we need to make sure that a representation with 
the highest fitness in N(r, e) is present in the candidate pool and that the fitness of each candidate 
mutation is estimated sufficiently accurately. We denote a representation with the highest fitness by 
r*. The bound on the number of generations that we are going to prove is g(n, 1/e) = 8/9. To ensure 
that r* is with probability at least 1 — e/4 in the candidate pool in every generation we set p(n, 1 /e) = 
\N(r, e)| .m 4 - g( "; 1/£) . To ensure that with probability at least 1 — e/4 in every generation the fitness 
of each mutation is estimated within 9{n, 1 /e) /8 we set s(n, 1/e) = c • 9(n, 1/e) -2 • log 8, p("' 1 / e ) , 9( n ' 1 / £ ) 
for a constant c (obtained via the Hoeffding's bound). We set the tolerance of the selection rule to 

t(n, 1/e) = 3 • 0(ra, l/e)/8. 
By the properties of Af, 

L Q PeTf f (r*,D) > min{L Q Perf f (r, D) + 9{n, l/e)/2, 1 - e/2} . 

If L Q PeTf f (r,D) < 1 - e then i Q Perf / (r*, D) > L Q Perf f (r,D) + 0(n,l/e)/2 (without loss of 
generality 9(n, 1/e) < e). In this case if r* is in the pool of candidates Z and the empirical fitness 
of every mutation in Z is within din, l/e)/8 of the true fitness then Benez(r) is non-empty and for 
every r' G Bene^(r), LgPerf f(r', D) > LgPerf f(r, D) + 9(n 7 l/e)/4. In particular, the output of 
SelNB[£Q,i(ra, l/e),p(n, l/e),s(n, 1/e)] will have fitness at least LgPerf/(r, D) + 9{n, l/e)/4. The 
lowest initial fitness is —1 and therefore, with probability at least 1 — e/2, after at most g(n, 1/e) = 
8/9(n, 1/e) steps a representation with fitness at least 1 — e will be reached. 

We also need to establish that once the fitness of at least 1 — e is reached it does not decrease 
within g(n, 1/e) steps and also prove that the evolution algorithm is monotone. To ensure this we 
modify slightly the mutation algorithm M. The algorithm M 1 outputs a randomly and uniformly 
chosen representation in N (r, e) with probability A = e/(2 • g(n, 1 /e)) and outputs r with probability 
1 — A. We also increase p(n,l/e) accordingly to ensure that r* is still in the pool of candidates 
with sufficiently high probability. This change does not influence the analysis when Bene^(r) is non- 
empty. If Bene^(r) is empty then, by the definition of M', SelNB[LQ, t(n 7 l/e),p(n 7 1/e), s(n, 1/e)] 
will output r with probability at least 1 — A. That is in every step, either the fitness improves or it 
does not change with probability at least 1 — A. In particular, with probability at least 1 — e/2 the 
fitness will not decrease during any of the first g(n, 1/e) generations. □ 

5.4 Distribution-Independent Evolvability of Disjunctions 

A substantial limitation of the general transformation given in the previous section is that the mu- 
tation algorithm given there requires access to D and hence only implies evolvability for a fixed 
distribution. In this section we show that for the concept class of disjunctions (and conjunctions) the 
ideas of the transformation in Section 15.31 can be used to derive a simple algorithm for distribution- 
independent monotone evolvability of disjunctions. 

As usual in distribution-independent learning, we can assume that the disjunction is monotone. 
We represent a monotone disjunction by a subset T C [n] containing the indices of the variables in 
the disjunction and refer to it as tx- In addition, we define 9t = (ir + l)/2 and for every i S [n], let 
9i(x) — (1 + Xi)/2 (9t and 9i are simply the {0, 1} versions of £t and x{). Given a current hypothesis 
computing function <fi € J 7 ^ we try to modify it in two ways. The first one is to add 7 • 9i(x) and 
project using Pi for some i £ [n] and 7 > 0. The other one is to subtract 7 and project using Pi. The 
purpose of the first type of modification is to increase fitness on points where the target disjunction 
equals to 1. It is easy to see that such steps can make the fitness on such points as close to 1 as 



17 



desired. The problem with such steps is that they might also add 7 • Oi such that xi is not in the 
target disjunction and thereby decrease the fitness on points where the target equals —1. We fix this 
by using the second type of modification. This modification increases the fitness on points where the 
target equals —1 but may decrease the fitness on points where the target equals 1 . The reason why 
this combination of modifications will converge to a good hypothesis is that for the quadratic loss 
function the change in loss due to an update is larger on points where the loss is larger. Namely, 
Lq(u, y' + A) = Lg(y, y') + 2 • A • (y — y') + A 2 . This means that if the first type of modification 
can no longer improve fitness then the second type will. We formalize this argument in the lemma 
below. 

Lemma 5.6 For cj){x) G J"f°, let iV 7 0) = {P^ + 7 • 0j) | i e [n]} U j>, PiO - 7)}. There exist 
inverse polynomial t(-,-) and "/(■,■) such that for every distribution D over {0,1}™, every target 
monotone disjunction f, every e > and every <f)(x) e there exists <E 7V 7 („ ;1 / € )(0) for which 

LgPerf f ((/)', D) > min{i Q Perf /(</>, D) + r(n, 1/e), 1 - e} . 

Proof: Let / = tr denote the target monotone disjunction. By the definition ||/ — <p\ 2 D = 2(1 — 
L Q PeTf f (4>, D)). We denote the loss of cj> when / restricted to 1 and -1 by Ai = E D [(f-(j)) 2 -(f+l)/2] 
and A_i = Eu[(/ - (f>) 2 ■ (/ - l)/2] respectively. Let 7 = e 3 / 2 /21 and r = 7 4 /(8n). We split the 
analysis into several cases. 

1. If LgPerf j{<j), D) > 1 — e then <f>' = <f> satisfies the condition. 

2. L Q PeTff((j>,D) < 1 - e and Ax > 27 2 . In this case, 

Ai < Pr D [f(x) = 1, <j>{x) > 1 - 7] • 7 2 + Pr D [f(x) = 1, <f>(x) < 1 - 7] • 4 . 

Therefore 

Pr D [f(x) = 1, cj){x) < 1 - 7] > (Ai - 7 2 )/4 > 7 2 /4 . 

The target function is a disjunction of at most n variables therefore there exists i e T such 
that Vr D [xi = 1, (f>{x) < 1 - 7] > 7 2 /(4n). For such i, let <f>' = + 7 • 0;). Note that for 
every point x, the loss of <fi'(x) is at most the loss of 4>{ x ) while for every point where Xi = 1 
and 4>{x) < 1 — 7 the loss of (j>'{x) is smaller than the loss of 4>{x) by at least 7 2 . Therefore, 



11/ - < 11/ - Hi 7 2 • Pr D [ Xi - 1, 0(x) < 1 - 7] < 11/ - Hi ~ 




This implies that 

L Q Perf f ((/)', D) > L Q Peif f (<f>, D) + r(n, 1/e) 

for t defined as above. 

3. L Q Perf / (0,D) < 1-e and Ai < 27 2 . In this case A_i > 2e- Ai > 3-e/2. Let </>' = ^1(^-7). 
We now upper bound the increase in error on points where / = 1 and lower bound the decrease 
in error on points where / = — 1. For the upper bound we have 

Ed[(/ - ^ - if] < 2 • E D [(f - 0) 2 ] + 2 • 7 2 , 

and therefore the increase in error when / = 1 is at most Ai + 2 • 7 2 < 4 • 7 2 . For the lower 
bound similarly to the previous case we get the inequality 

A_i < Pr D [f(x) = -1, 4>{x) < -1 + y/l/2] ■ e/2 + Pr D [f(x) = -1, <j>{x) > -1 + ■ 4 . 
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Therefore 



Pr D [f(x) = -1, 4>(x) > -1 + y/i/2] > (A_i - e/2)/4 > e/4 . (1) 
On every point x where f(x) = — 1 and <j>(x) > — 1 + \fe/2, 

\f(x) - 4>'(x)\ 2 < |/0r) - ^x)! 2 - (2 7 (0(i) - /(*)) - 7 2 ) < !/(*) - </>(*)| 2 - 2 7 ^/2 + J 2 ■ 
By combining this with equation ([]} and our choice of 7 = e 3 / 2 /21 we get 

||/-0'||| ) <||/-0||? ) -^( 7 ^-7 2 )<ll/-0llL-5- 7 2 • 
Therefore in this case 

LqPerf /(</>', D) > LgPerf /(</>, £>) + (5 • 7 2 - 4 • 7 2 )/2 > LgPerf £>) + r(n, 1/e) . 

□ 

The neighborhood iV 7 (0) can be computed efficiently and therefore Lemma [?751 can be converted 
to an evolutionary algorithm in exactly the same way as it was done in Theorem 15.51 This implies 
monotone and distribution-independent evolvability of disjunctions in SelNB[I/Q, t,p, s]. 

Theorem 5.7 There exist polynomials p(n, 1/e) and s(n, 1/e), an inverse polynomial t(n, 1/e) and 
a mutation algorithm A = (R, M) such that for every distribution D disjunctions are evolvable 
monotonically by A over D in SelNB[ig, t(n, 1/e), pin, 1/e), s(n, 1/e)]. 

6 Discussion and Further Work 

One natural question not covered in this work is whether and how our characterization can be applied 
to understanding of the SQ complexity of learning specific concept classes for which the previously 
known characterizations are not sufficient. As we explained in the introduction, one such example 
is learning of monotone functions. This question is addressed in a recent work |18j . where the first 
lower bounds for SQ learning of depth-3 monotone formulas over the uniform distribution are derived 
using SQ-SDIM. The main open problem in this direction is evaluating the SQ-SDIM of monotone 
DNF over the uniform distribution. 

As we have mentioned, another way to see our proof of Theorem l3.4l is as a boosting algorithm that 
instead of using a weak learning algorithm on different distributions uses a weak learning algorithm 
on different target functions (specifically on / — tpi at iteration i). This perspective turned out to be 
useful for understanding of boosting in the agnostic learning framework. In particular, it has lead to 
the distribution-specific boosting algorithm given in Theorem 14.31 and to a new connection between 
agnostic and PAC learning. 

We also believe that the insights into the structure of SQ learning given in this work will be 
useful in further exploration of Valiant's model of evolvability. For example, Theorem 15.41 can also 
be used to obtain distribution-specific evolvability of every SQ-learnable concept class with only very 
weak assumptions on the selection rule (such as it, 7 )-distinguishing defined in |16jV In addition, we 
believe that our results are not restricted to the quadratic loss function and can applied to evolv- 
ability with other loss functions. Perhaps, the most interesting question in this direction is whether 
results analogous to Theorem [53] can also be obtained for distribution- independent evolvablity. The 
distribution-indepedent evolvability of disjunctions given in Theorem 15.71 suggests that the answer 
might be "yes" for many other interesting concept classes. We hope that these questions will be 
investigated in subsequent work. 
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